Back to Article
Supp. Mat. 3 - Species distribution model
Download Notebook

Training a SDM

This vignette will build upon concepts developped in the first two case studies, to show how SDT can be used to train and apply a species distribution model.

In [1]:
# Load the required packages
using SpeciesDistributionToolkit
using CairoMakie
using Statistics
CairoMakie.activate!(px_per_unit = 6.0)
import Random
Random.seed!(12345678)
Random.TaskLocalRNG()

The next steps cover the same grounds as the first two case studies, so we will not go into the details of the code:

In [2]:
# Get the polygon of interest
place = SpeciesDistributionToolkit.gadm("PRY")
extent = SpeciesDistributionToolkit.boundingbox(place)
provider = RasterData(CHELSA2, BioClim)
L = SDMLayer{Float16}[SDMLayer(provider; layer=l, extent...) for l in layers(provider)]
mask!(L, place);

We now get the species occurrence data from GBIF - these are the same as in the first case study. Note that to follow the GBIF best practices, we download this dataset using its DOI. This is recommended both to avoid sending multiple requests to the API, but also to ensure that the dataset can be properly cited. When using the download version, the records are returned in the format of the OccurrencesInterface package, which is transaprently usable by SpeciesDistributionToolkit.

In [4]:
# Get occurrence data from GBIF
tx = taxon("Akodon montensis")
presences = GBIF.download("10.15468/dl.d3cxpr")
Retrieved 562 occurrences

To generate pseudo-absence data, we start by projecting the occurrence to a layer. This will result in a spatially thined layer in which cells without an occurrence have a value of false, and those with at least an occurrence have a value of true.

In [4]:
# Pseudo-absence generation
presencelayer = mask(first(L), presences)
🗺️  A 998 × 1007 layer with 507665 Bool cells
   Projection: +proj=longlat +datum=WGS84 +no_defs

We now generate a layer storing the distance between each cell and the nearest observation. Distances are calculated using the haversine formula.

In [5]:
background = pseudoabsencemask(DistanceToEvent, presencelayer)
🗺️  A 998 × 1007 layer with 507665 Float64 cells
   Projection: +proj=longlat +datum=WGS84 +no_defs

We will now remove all cells less than 20 km from an observation. Pseudo-absences will not be generated in this range.

In [6]:
maskedbg = nodata(background, d -> d < 50)
heatmap(maskedbg)

We finally generate background points (6 times as much as cells with occurrences), proportionally to their distance to an observation.

In [36]:
bgpoints = backgroundpoints(maskedbg, 6sum(presencelayer))
🗺️  A 998 × 1007 layer with 398843 Bool cells
   Projection: +proj=longlat +datum=WGS84 +no_defs

At this point, the data (shown over the temperature layer) look like this:

In [37]:
# Illustration
heatmap(L[1], colormap=:thermal)
lines!(place[1].geometry, color=:black)
scatter!(presencelayer, color=:white, strokecolor=:black, strokewidth=2, markersize=10, label="Virtual presences")
scatter!(bgpoints, color=:white, strokecolor=:grey, strokewidth=1, markersize=7, label="Pseudo-absences")
current_figure()

We now define the structure for a decision tree: the data will be transformed through a PCA, then classified using a decision tree. Note that the model call accepts, in that order, the layers with covariates, the layer with presences, and the layer with background points:

In [38]:
# Specification of the SDM
sdm = SDM(PCATransform, DecisionTree, L, presencelayer, bgpoints)
PCATransform → DecisionTree → P(x) ≥ 0.5

As an important sidenote, SDeMo will never perform a PCA on the entire layer, as this would immediately create data leakage. The PCA is trained only on the presence data (as we have generated pseudo-absences through a heuristic), unless the absence data are specifically requested.

As we do not have a large volume of data, we will limit the number of nodes in the tree to twelve, with a maximal depth of five. This avoids trees that are too dense or too deep, and can protect against overfitting. Note that SDeMo always prunes trees after training.

In [39]:
maxnodes!(sdm, 12)
maxdepth!(sdm, 5)
PCATransform → DecisionTree → P(x) ≥ 0.5

We can now perform k-fold cross-validation using all layers we downloaded. The crossvalidate function returns two arrays of confusion matrices, which can then be passed to one of the (many) evaluation measures SDeMo supports. We look at the mean and 95% CI of the Matthews correlation coefficient, to ensure that the model is trainable, and that the tree does not overfit.

In [40]:
# Cross-validation
folds = kfold(sdm; k=10);
cv = crossvalidate(sdm, folds; threshold = true);
println("""
Validation: $(round(mcc(cv.validation); digits=2)) ± $(round(ci(cv.validation); digits=2))
Training: $(round(mcc(cv.training); digits=2)) ± $(round(ci(cv.training); digits=2))
""")
Validation: 0.8 ± 0.07
Training: 0.85 ± 0.03

Because we have likely included many irrelevant predictors, we will perform variable selection. We start by identifying a series of variables with low colinearity using the stepwise variance inflation factor, then perform forward selection within this pool:

In [41]:
noselection!(sdm, folds) # We reset the model to use all features
stepwisevif!(sdm, 100.0)
@info variables(sdm)
forwardselection!(sdm, folds; verbose = true)
cv = crossvalidate(sdm, folds);
println("""
Validation: $(round(mcc(cv.validation); digits=2)) ± $(round(ci(cv.validation); digits=2))
Training: $(round(mcc(cv.training); digits=2)) ± $(round(ci(cv.training); digits=2))
""")
┌ Info: [2, 4, 5, 8, 9, 10, 13, 14, 15, 16, 18, 19]
└ @ Main /home/tpoisot/Documents/ms_sdt_software/appendix/jl_notebook_cell_df34fa98e69747e1a8f8a730347b8e2f_X33sZmlsZQ==.jl:3
┌ Info: Current performance: 1.4166908387143924e-17 - working over 1 variables
└ @ SDeMo /home/tpoisot/.julia/packages/SDeMo/a4i4c/src/variables/selection.jl:78
┌ Info: Current performance: 0.8895751160151599 - working over 2 variables
└ @ SDeMo /home/tpoisot/.julia/packages/SDeMo/a4i4c/src/variables/selection.jl:78
┌ Info: Current performance: 0.9033936755223563 - working over 3 variables
└ @ SDeMo /home/tpoisot/.julia/packages/SDeMo/a4i4c/src/variables/selection.jl:78
┌ Info: Current performance: 0.9096252342475142 - working over 4 variables
└ @ SDeMo /home/tpoisot/.julia/packages/SDeMo/a4i4c/src/variables/selection.jl:78
Validation: 0.91 ± 0.06
Training: 0.91 ± 0.01
┌ Info: Returning with [8, 1, 2] -- 0.9096252342475142
└ @ SDeMo /home/tpoisot/.julia/packages/SDeMo/a4i4c/src/variables/selection.jl:93

We can check which variables have been retained. They are printed out after being shuffled to prevent the temptation of looking for meaning in their order.

In [42]:
Random.shuffle(variables(sdm))
3-element Vector{Int64}:
 2
 8
 1

We can finally train the model using all the available training data:

In [43]:
train!(sdm)
PCATransform → DecisionTree → P(x) ≥ 0.006

Note that as part of the training, the threshold of the model (the score above which a prediction of “presence” is made) is automatically adjusted. The threshold can be turned on and off at prediction time (to get either the score or the range) using the threshold keyword.

The initial prediction of the model can be made by calling the predict method on the model followed by a vector of layers, in which case it will be returned as a layer. Other methods for predict exist, and they all accept many keywords.

In [44]:
# Initial prediction (decision tree)
fg, ax, hm = heatmap(predict(sdm, L; threshold=false), colormap=Reverse(:linear_gow_60_85_c27_n256))
lines!(place[1].geometry, color=:black)
scatter!(presencelayer, color=:white, strokecolor=:black, strokewidth=2, markersize=10, label="Virtual presences")
scatter!(bgpoints, color=:white, strokecolor=:grey, strokewidth=1, markersize=7, label="Pseudo-absences")
Colorbar(fg[1,2], hm)
fg

As we used a small tree, it may be beneficial to turn it into an ensemble model. We do so by creating a Bagging object with 30 copies of the model, and we additionally bag the features used for each tree:

In [45]:
ensemble = Bagging(sdm, 20)
bagfeatures!(ensemble)
{PCATransform → DecisionTree → P(x) ≥ 0.006} × 20

When training the ensemble, SDeMo will retrain every tree using the assigned variables and observations:

In [46]:
train!(ensemble)
{PCATransform → DecisionTree → P(x) ≥ 0.006} × 20

As this is a bagging ensemble, we can calculate the out of bag error (one minus accuracy):

In [47]:
outofbag(ensemble) |> (x) -> 1-accuracy(x)
0.03147699757869249

When predicting from an ensemble, the additional keyword consensus can be used, to specify how the predictions from each tree are used. Note that any model that SDeMo knows about can be used as part of an ensemble, and the discussion here is not specific to trees. The consensus function needs to accept and array and return a value, and the default is the median. SDeMo provides the built-in iqr (inter-quartile range) and majority (class recommended by the most trees). We can plot the median prediction of the model:

In [48]:
prediction = predict(ensemble, L; threshold=false)
fg, ax, hm = heatmap(prediction, colormap=:linear_worb_100_25_c53_n256)
lines!(place[1].geometry, color=:black)
Colorbar(fg[1,2], hm)
current_figure()

The code to plot the areas where a majority of trees predict that the species may be present is very similar: note that we use threshold=true to work on boolean predictions, and consensus=majority to get the most frequently recommended class.

In [49]:
modelrange = predict(ensemble, L; threshold=true, consensus=majority)
fg, ax, hm = heatmap(modelrange, colormap=:Greens)
lines!(place[1].geometry, color=:black)
current_figure()

Before moving on, we can visualize the uncertainty in the ensemble model:

In [50]:
uncertainty = predict(ensemble, L; threshold=false, consensus=iqr)
fg, ax, hm = heatmap(uncertainty, colormap=:matter)
lines!(place[1].geometry, color=:black)
Colorbar(fg[1,2], hm)
current_figure()

We will now demonstrate how this prediction can be clipped by the elevational range of the species. We get the elevation information from another data provider (WordClim v2):

In [51]:
# Get the elevation data
elevprov = RasterData(WorldClim2, Elevation)
elev = convert(SDMLayer{Float16}, SDMLayer(elevprov; resolution=0.5, extent...))
🗺️  A 998 × 1007 layer with 1004986 Float16 cells
   Projection: +proj=longlat +datum=WGS84 +no_defs

As the elevation data comes from another provider, there is chance that it will not be compatible with the environmental layers we have. This can be due to a different CRS, or to a different spatial resolution or cell coordinates. For this reason, we will first create a new layer that is similar to the covariates:

In [52]:
elevation = similar(L[1], Float16)
🗺️  A 998 × 1007 layer with 507665 Float16 cells
   Projection: +proj=longlat +datum=WGS84 +no_defs

We now interpolate the original elevation layer to the new one. Interpolation currently relies on bilinear filtering, but we are in the process of expanding the method to support arbitraty functions. Note that the interpolation takes care of bringing the data to the correct CRS.

In [53]:
interpolate!(elevation, elev)
🗺️  A 998 × 1007 layer with 1002986 Float16 cells
   Projection: +proj=longlat +datum=WGS84 +no_defs

We can now mask the elevation data using the polygon for Paraguay:

In [54]:
mask!(elevation, place)
heatmap(elevation, colormap=:terrain)

As in the first case study, we can read the elevation for known occurrences directly, and feed this to the extrema function to get the limit of altitudes at which the species has been observed:

In [55]:
# Get the elevation range
elevation_range = extrema(elevation[presences])
(Float16(60.0), Float16(458.5))

We can create a mask specifying where the altitude is within the range of the species:

In [56]:
in_elev = (e -> (elevation_range[1] <= e <= elevation_range[2])).(elevation)
🗺️  A 998 × 1007 layer with 507665 Bool cells
   Projection: +proj=longlat +datum=WGS84 +no_defs

Finally, we create a layer with the range according to the ensemble model:

In [57]:
in_range = predict(ensemble, L; threshold=true, consensus=majority)
🗺️  A 998 × 1007 layer with 507665 Bool cells
   Projection: +proj=longlat +datum=WGS84 +no_defs

Because these are both Boolean layers, we can apply the “and” operator to identify areas that have the correct environment and elevation:

In [64]:
final_range = in_range .& in_elev
🗺️  A 998 × 1007 layer with 507665 Bool cells
   Projection: +proj=longlat +datum=WGS84 +no_defs
In [65]:
f = Figure(; size=(600, 600))
ax = Axis(f[1,1], aspect=DataAspect())
heatmap!(ax, final_range, colormap=:Greens)
lines!(ax, place[1].geometry, color=:black)
scatter!(ax, presencelayer, color=:white, strokecolor=:black, strokewidth=1, markersize=10, label="Observed presences")
scatter!(ax, bgpoints, color=:lightgrey, strokecolor=:darkgrey, strokewidth=1, markersize=5, label="Pseudo-absences")
axislegend(ax, position=:lb, framevisible=false)
tightlimits!(ax)
hidespines!(ax)
hidedecorations!(ax)
f
Figure 1: Predicted range of Akodon montensis in Paraguay based on a rotation forest trained on GBIF occurrences and the BioClim variables. The predicted range is clipped to the elevational range of the species. The code to produce this figure is available as Supp. Mat. 3.

Because the model is trained, we can re-use it to e.g. calculate the variable importance (which SDeMo does through non-parametric boostrap in a model-agnostic way):

In [66]:
vi = variableimportance(ensemble, montecarlo(sdm, n=10); threshold=false, samples=100)
vi ./= sum(vi)
vnames = ["BIO$(x)" for x in variables(ensemble)]
collect(zip(vnames, vi))
3-element Vector{Tuple{String, Float64}}:
 ("BIO8", 0.6904721914010669)
 ("BIO1", 0.3053564282929377)
 ("BIO2", 0.004171380305995421)

We can identify the variable that was the most important overall (and its relative importance):

In [68]:
mix = findmax(vi)
(0.6904721914010669, 1)

It corresponds to:

In [69]:
mostimp = layerdescriptions(provider)[layers(provider)[variables(ensemble)[last(mix)]]]
"Mean Temperature of Wettest Quarter"

In the last part of this vignette, we will compare two techniques to explain the predictions made by the ensemble model.

Partial responses are commonly used in SDMs, and represent the prediction from a model when all variables but one are averaged (regular partial responses) or sampled at random from their distribution (inflated partial responses). These provide information about the ways in which the model respond to change in a variable of interest, compared to the “background” of all other variables.

SDeMo has code to perform the calculation of Shapley values. The explain method will perform the Monte-Carlo analysis on a pre-set number of samples. By contrast to the partial responses, Shapley values are calculated at the scale of a single prediction, and represent the departure from the average model prediction due to the specific value of the variable of interest for this prediction. References in the main text provide illustrations and explanations of how Shapley values can be used. To make sure that the output is comparable to other partial responses, we add the average prediction of the model to the Shapley values:

In [70]:
shapval = explain(ensemble, variables(sdm)[mix[2]]; threshold=false, samples=100)
shapresp = shapval .+ mean(predict(ensemble; threshold=false))
si = clamp.(shapresp, 0, 1)
hist(si)

Finally, we will plot the Shapley responses, overlaid on top of the partial and inflated partial responses for this variable. This figure is presented in the main text:

In [78]:
fig = Figure(; size = (600, 360))
ax = Axis(fig[1,1], xlabel=mostimp, ylabel="Response (presence score)")
for _ in 1:200
    lines!(ax, partialresponse(ensemble, variables(sdm)[mix[2]]; threshold=false, inflated=true)..., color=:lightgrey, alpha=0.5, label="Infl. partial response")
end
lines!(ax, partialresponse(ensemble, variables(sdm)[mix[2]]; threshold=false)..., color=:red, label="Partial response", linestyle=:dash, linewidth=2)
scatter!(ax, features(ensemble, variables(sdm)[mix[2]]), si, label="Shapley values", color=labels(ensemble), colormap=:Greens, strokecolor=:black, strokewidth=1)
tightlimits!(ax)
ylims!(ax, 0, 1)
axislegend(ax, position=:rt, framevisible=false, merge=true, unique=true)
fig
Figure 2: Partial responses (red) and inflated partial responses (grey) to the most important variable. In addition, the Shapley values for all training data are presented in the same figure; green points are presences, and pale points are pseudo-absences. Shapley values were added to the average model prediction to be comparable to partial responses. The code to produce this figure is available as Supp. Mat. 3.